Both Linear Regression and k Nearest Neighbor are popular machine learning models used make perdition of an event. Linear regression is used to predict result with scalar values. In linear regression, we are trying to use a linear predictor function, the best fit line to predict the result with known attribute values. And the best fit line was calculated to minimize the residual error. In K Nearest Neighbor, we believe that new data with certain attributes value should be classified into the group with the closest attribute. Thus, a prediction problem, whose results are continuous values, is easier to be modeled by linear regression, but prediction problem, whose job is to determine if an object or data is a member of certain group is easier to be modeled with K Nearest Neighbor.

In the problem of my choice, I want to predict the bike load on Manhantan Bridge based on the temperature and precipitation. The bike loads on the bridge are continuous scalar values, so I chose to use linear regression.

Procedures

In this prediction, I use the linear model from python scikit learn library. To use the linear model, I need to

  1. Define my features and write them into matrix, or dataframe. The rows will be the instances and the columns will be different attributes.
  2. Define my result vector
  3. Separate the instances into training and testing sets
  4. Apply the Linear model
  5. Calculate the error

In [137]:
import sqlite3
import pandas as pd
from pprint import pprint
from pandas import DataFrame
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
import numpy as np
conn = sqlite3.connect('bicycle.db')
c=conn.cursor()
c.execute('SELECT HiTemp, LoTemp, Precip FROM bicycle')
features=c.fetchall()
c.execute('SELECT Manhattan FROM bicycle')
results =c.fetchall()

Step 1


In [138]:
X=DataFrame(features, columns=['HiTemp', 'LoTemp', 'Precip'])
y=DataFrame(results, columns=['Man_Count'])
#training set
X_tr=X[::2]
y_tr=y[::2]
#testing set
X_ts=X[1::2]
y_ts=y[1::2]

Step 2


In [139]:
model=LinearRegression()
model.fit(X_tr,y_tr)
prd_y=model.predict(X_ts)

Step 3


In [140]:
def getMeanAbsErr(predicted, actual):
    n = len(predicted)
    totalerr = 0
    for i in range(n):
        totalerr = totalerr + abs(predicted[i]-actual[i])
    return (totalerr/n)

Step 4


In [141]:
d1=DataFrame(a, columns=['Actual'])
d2=DataFrame(prd_y, columns=['Predicted'])
cmp=d1.join(d2)
print(cmp)


    Actual    Predicted
0     1646  3457.933791
1     1067  2110.539456
2     3329  3724.644889
3     3455  3281.825021
4     2387  4069.368024
5     2178  3671.953886
6     5309  4782.952693
7     4316  5148.943760
8     6823  6324.290449
9     6574  5132.677470
10    3276  4386.691010
11    5978  4854.506845
12    5606  4685.120205
13    4178  4035.984166
14    1525  3103.134521
15    1986  1901.946546
16    4196  4184.266974
17    3157  4187.581109
18    6591  5312.824212
19    7216  5580.234735
20    3072  3657.211004
21    6214  5139.366619
22    2864  4604.958870
23    6843  5621.189662
24    7869  6412.919824
25    5670  5836.998149
26    2750   553.970944
27    7565  5610.061956
28    3708  4614.962155
29    1593  1996.559029
..     ...          ...
68    5904  6030.097045
69    5340  5853.264438
70    7357  4946.450356
71    6248  5915.380728
72    4470  6199.301051
73    6433  5444.172085
74    9152  5873.784315
75    7598  5109.026091
76    5210  5478.896716
77    2260  2696.422344
78    7411  5891.546716
79    6702  6275.236095
80    5013  5070.561180
81    5942  4464.149461
82    4728  4310.239176
83    2440  4255.507975
84    5729  4969.223161
85    6906  4766.120980
86    6209  5414.344864
87    1173  2462.891327
88    6201  4773.506069
89    5477  4530.303587
90    4057  5070.561180
91    7594  5690.506070
92    5588  4814.406543
93    1254  2636.781079
94    5224  3643.130807
95    1558   -40.966142
96    3160  5109.539398
97    4876  4013.001432

[98 rows x 2 columns]

Step 5


In [142]:
#print(getMeanAbsErr(prd_y, y_ts))
#print(cmp['Actual'])
score=model.score(X_ts, y_ts)
print(score)
err=getMeanAbsErr(cmp['Actual'], cmp['Predicted'])
print(err)
conn.close()


0.409284953232
1186.078914